Module 2: Data Wrangling

class: center, middle, inverse, title-slide

.title[
# Module 2: Data Wrangling
]
.subtitle[
## Introduction to Tools of the Trade in Data Analysis
]
.author[
### Dr. Christopher Kenaley
]
.institute[
### Boston College
]
.date[
### 2024/9/9
]

---

class: inverse, top
# In class today

```
## Warning: package 'kableExtra' was built under R version 4.2.3
```

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.14.0/css/all.min.css">

.pull-left[
Today we'll ....

- Review/Learn about the pipe: `%>%`

- Load some data

- Perform some tidy operations

- Peak under the hood of Module Project 2

]

.pull-right[
![](http://www.alaskapublic.org/wp-content/uploads/2013/09/trans-alaska-pipeline-dnr.jpg)
]

---
class: inverse, top

## What is the pipe (`%>%`)?

- comes from the `magrittr` package

- loaded automatically with the super package `tidyverse`

- makes code concise:
  * streamlining many operations into fewer LOC
  
  * reduces repetitive tasks

``` r
iris <- group_by(iris,Species)
summarise(iris,mean_length=mean(Sepal.Length))
```

```
## # A tibble: 3 × 2
##   Species    mean_length
##   <fct>            <dbl>
## 1 setosa            5.01
## 2 versicolor        5.94
## 3 virginica         6.59
```

---
class: inverse, top

## What is the pipe (`%>%`)?

``` r
iris <- group_by(iris,Species)
summarise(iris,mean_length=mean(Sepal.Length))
```

``` r
iris%>%
  group_by(Species)%>%
  summarize(mean_length=mean(Sepal.Length))
```

```
## # A tibble: 3 × 2
##   Species    mean_length
##   <fct>            <dbl>
## 1 setosa            5.01
## 2 versicolor        5.94
## 3 virginica         6.59
```

---

## What is the pipe (`%>%`)?

- more apparent when plotting (major piece of data science)

``` r
iris <- group_by(iris,Species)
iris_mean <- summarise(iris,mean_length=mean(Sepal.Length))

ggplot(data=iris_mean,aes(x=Species,y=mean_length))+geom_bar(stat="identity")
```

``` r
iris%>%
  group_by(Species)%>%
  summarize(mean_length=mean(Sepal.Length))%>%
  ggplot(aes(x=Species,y=mean_length))+geom_bar(stat="identity")
```

![](3140_f24_9-9_files/figure-html/unnamed-chunk-6-1.png)

---

## Loading data

- `readr` package has several handy functions.

- `read_csv()` most handy

``` r
d <- read_csv("https://bcorgbio.github.io/class/data/coyote.csv")
head(d)
```

```
## # A tibble: 6 × 10
##   id      species       Region    state  County Town  Locality   Lat  Long  Year
##   <chr>   <chr>         <chr>     <chr>  <chr>  <chr> <chr>    <dbl> <dbl> <dbl>
## 1 adk2706 Canis latrans northeast New Y… <NA>   <NA>  <NA>      43.8 -75.0  2007
## 2 adk2798 Canis latrans northeast New Y… <NA>   <NA>  <NA>      43.9 -74.7    NA
## 3 adk2801 Canis latrans northeast New Y… <NA>   <NA>  <NA>      43.9 -74.7    NA
## 4 adk2833 Canis latrans northeast New Y… <NA>   <NA>  <NA>      43.9 -74.8    NA
## 5 adk2845 Canis latrans northeast New Y… <NA>   <NA>  <NA>      42.8 -73.8    NA
## 6 adk2853 Canis latrans northeast New Y… Herkm… Lost… Lost Cr…  43.8 -75.0  2002
```

---

## Loading data

``` r
d%>%
  group_by(state)%>%
  dplyr::summarize(n=n())%>%
  ggplot(aes(x=state,y=n))+geom_bar(stat="identity")+coord_flip()
```

![](3140_f24_9-9_files/figure-html/unnamed-chunk-8-1.png)